Skip to content

Conversation

@msaroufim
Copy link
Member

@msaroufim msaroufim commented Jan 29, 2026

dont bother reviewing!

Copilot AI review requested due to automatic review settings January 29, 2026 18:27
@github-actions
Copy link

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  src/libkernelbot
  consts.py
Project Total  

This report was generated by python-coverage-comment-action

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a proof-of-concept Buildkite integration for running GPU kernel submissions on vendor-donated hardware. Buildkite provides better isolation guarantees compared to GitHub Actions by binding each agent to a single GPU with CPU/RAM limits enforced via systemd cgroups.

Changes:

  • Adds BuildkiteLauncher to create and monitor Buildkite builds via API
  • Implements buildkite-runner.py script that executes inside Docker containers on Buildkite agents
  • Provides infrastructure scripts for vendors to set up isolated GPU agents
  • Includes comprehensive documentation and testing guides

Reviewed changes

Copilot reviewed 16 out of 17 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/libkernelbot/launchers/buildkite.py New launcher that creates Buildkite builds, polls for completion, and downloads artifacts
src/runners/buildkite-runner.py Runner script that decompresses payload and executes submissions in Buildkite agents
src/libkernelbot/consts.py Adds BuildkiteGPU enum and BUILDKITE scheduler type
src/libkernelbot/launchers/init.py Exports BuildkiteLauncher
src/kernelbot/main.py Registers BuildkiteLauncher if API token is configured
src/kernelbot/env.py Adds Buildkite environment variables (optional)
tests/test_buildkite.py Unit tests for BuildkiteLauncher, GPU enums, and runner script
docker/kernelbot-runner/Dockerfile Container image with CUDA, PyTorch, and kernelbot dependencies
docker/kernelbot-runner/requirements-runner.txt Python dependencies for the runner image
.buildkite/pipeline.yml Buildkite pipeline configuration for running submissions in Docker
.github/workflows/build-runner-image.yml Workflow to automatically build and push runner image
scripts/buildkite/setup-agent.sh Script for vendors to configure per-GPU agents with resource isolation
scripts/buildkite/update-image.sh Script for vendors to update runner image and restart agents
docs/docs/vendor-onboarding/buildkite.md Vendor onboarding guide with setup and troubleshooting instructions
docs/docs/vendor-onboarding/testing-guide.md Comprehensive testing guide covering unit, container, agent, and E2E tests
docs/BUILDKITE_POC.md POC documentation describing architecture, implementation, and next steps
src/discord_cluster_manager.egg-info/* Auto-generated package metadata files

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +260 to +266
profile_artifact = None
for artifact in artifacts:
if artifact.get("filename") == "result.json":
result_artifact = artifact
elif artifact.get("path", "").startswith("profile_data/"):
profile_artifact = artifact

Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The profile_artifact variable is only set when iterating over artifacts if a path starts with "profile_data/", but this will only capture the first matching profile artifact. If there are multiple profile artifacts (which is possible since the pipeline config specifies "profile_data/**/*"), only the first one will be stored and its download URL will be added to all ProfileResult objects. This could lead to incorrect profile data associations. Consider either collecting all profile artifacts or being more specific about which profile artifact to use.

Suggested change
profile_artifact = None
for artifact in artifacts:
if artifact.get("filename") == "result.json":
result_artifact = artifact
elif artifact.get("path", "").startswith("profile_data/"):
profile_artifact = artifact
profile_artifacts = []
for artifact in artifacts:
if artifact.get("filename") == "result.json":
result_artifact = artifact
elif artifact.get("path", "").startswith("profile_data/"):
profile_artifacts.append(artifact)
# Determine which profile artifact to use, if any
profile_artifact = None
if len(profile_artifacts) == 1:
profile_artifact = profile_artifacts[0]
elif len(profile_artifacts) > 1:
logger.warning(
"Multiple profile artifacts found; not associating a specific one with runs."
)

Copilot uses AI. Check for mistakes.
Comment on lines +27 to +37
class BuildkiteGPU(Enum):
# Queue naming: {vendor}-{gpu_type}
# Buildkite agents use tags like queue=nvidia-h100-0 for per-GPU routing
# The enum value is the queue prefix; agents append -N for specific GPU index
NVIDIA_H100 = "nvidia-h100"
NVIDIA_B200 = "nvidia-b200"
NVIDIA_A100 = "nvidia-a100"
AMD_MI300 = "amd-mi300"
AMD_MI250 = "amd-mi250"
GOOGLE_TPU = "google-tpu"
NEBIUS_H100 = "nebius-h100"
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BuildkiteGPU enum includes several GPU types (NVIDIA_H100, NVIDIA_B200, NVIDIA_A100, AMD_MI300, AMD_MI250, GOOGLE_TPU, NEBIUS_H100) but the GPU_TO_SM mapping dictionary (line 127-138 in consts.py) is missing entries for these new GPU types. When _get_arch is called with any of these new Buildkite GPU types, it will raise a KeyError. Add mappings for at least the NVIDIA GPUs: NVIDIA_H100 should map to "90a", NVIDIA_B200 to "100", NVIDIA_A100 to "80", and NEBIUS_H100 to "90a".

Copilot uses AI. Check for mistakes.
agents:
# Route to agent with matching queue tag
queue: "${GPU_QUEUE}"
timeout_in_minutes: 15
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pipeline timeout_in_minutes is hardcoded to 15 minutes, but the BuildkiteLauncher calculates timeouts dynamically based on config (get_timeout function) and adds a TIMEOUT_BUFFER_MINUTES. For submissions with longer configured timeouts, the Buildkite pipeline will timeout before the launcher's timeout logic can cancel the build. The pipeline timeout should either be set to a higher value or made configurable via an environment variable to align with the launcher's timeout calculation.

Suggested change
timeout_in_minutes: 15
# Pipeline timeout is configurable to align with BuildkiteLauncher timeouts.
timeout_in_minutes: ${PIPELINE_TIMEOUT_MINUTES:-60}

Copilot uses AI. Check for mistakes.
return {
"lang": "py",
"mode": "test",
"files": {"main.py": "print('hello')"},
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test config uses "files" but should use "sources" to match the expected config format used by run_config. Looking at other tests in the codebase (e.g., tests/test_backend.py and tests/test_task.py), the config passed to launchers should have "sources" as the key containing file mappings, not "files". This will cause the runner script to fail when it calls run_config with this payload.

Suggested change
"files": {"main.py": "print('hello')"},
"sources": {"main.py": "print('hello')"},

Copilot uses AI. Check for mistakes.
Comment on lines +71 to +76
self.token = token
self._headers = {
"Authorization": f"Bearer {token}",
"Content-Type": "application/json",
}

Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Buildkite API token is stored in plaintext in an instance variable. While this is consistent with how the GitHub launcher handles tokens, consider that tokens could be exposed in logs or error messages. Ensure that repr or str methods don't accidentally expose this token, and that error messages sanitize sensitive information.

Suggested change
self.token = token
self._headers = {
"Authorization": f"Bearer {token}",
"Content-Type": "application/json",
}
# Store the sensitive token in a private attribute to reduce accidental exposure.
self._token = token
self._headers = {
"Authorization": f"Bearer {token}",
"Content-Type": "application/json",
}
@property
def token(self) -> str:
"""
Read-only access to the Buildkite API token.
Kept as a property so external code can still access `launcher.token`,
while the actual storage remains in a private attribute.
"""
return self._token
def __repr__(self) -> str:
"""
Representation of the launcher that avoids exposing the API token.
"""
return (
f"{self.__class__.__name__}(org={self.org!r}, "
f"pipeline={self.pipeline!r})"
)

Copilot uses AI. Check for mistakes.
Comment on lines 222 to 224
if state != "passed":
logger.warning(f"Build #{build_number} finished with state: {state}")
return
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _wait_for_completion method returns successfully for builds in states "failed", "canceled", or "blocked", not just "passed". This means the method will return normally even when the build didn't succeed, and then the code proceeds to download artifacts. A failed or canceled build may not have artifacts, which could lead to confusing error messages. Consider checking the state after _wait_for_completion returns and returning an appropriate error FullResult for non-passed states before attempting artifact download.

Suggested change
if state != "passed":
logger.warning(f"Build #{build_number} finished with state: {state}")
return
if state == "passed":
return
logger.warning(f"Build #{build_number} finished with state: {state}")
raise RuntimeError(
f"Build #{build_number} finished with non-passed state: {state}"
)

Copilot uses AI. Check for mistakes.
Comment on lines 31 to 46
def main():
# Get payload from environment variable
payload_b64 = os.environ.get("SUBMISSION_PAYLOAD")
if not payload_b64:
raise RuntimeError("SUBMISSION_PAYLOAD environment variable not set")

# Decompress and parse config
payload = zlib.decompress(base64.b64decode(payload_b64)).decode("utf-8")
config = json.loads(payload)

# Run the submission
result = run_config(config)

# Write result to file for artifact upload
result_dict = asdict(result)
Path("result.json").write_text(json.dumps(result_dict, default=serialize))
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The buildkite-runner.py main function does not wrap the entire execution in a try-except block. If run_config raises an exception (e.g., due to an invalid config or runtime error), the exception will propagate and no result.json will be created. This will cause the artifact download in BuildkiteLauncher to fail with a confusing "Could not find result.json artifact" error rather than showing the actual error. Consider wrapping the execution in try-except and writing an error result to result.json on failure.

Copilot uses AI. Check for mistakes.
mock_download.return_value = MagicMock(success=True)

reporter = MockProgressReporter()
result = await launcher.run_submission(mock_config, gpu_type, reporter)
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable result is not used.

Suggested change
result = await launcher.run_submission(mock_config, gpu_type, reporter)
result = await launcher.run_submission(mock_config, gpu_type, reporter)
assert result.success is True

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants